Abstract:Open vocabulary 3D affordance detection is regarded as a critical link connecting high-level semantic understanding and low-level robotic manipulation to precisely localize object functional regions in unstructured environments. However, existing methods mostly rely on frozen pre-trained vision-language models for shallow feature matching. The generalization ability of these models is insufficient due to the dual challenges of semantic ambiguity in text instructions and geometry-semantic misalignment in feature space. To address these issues, a synergistic semantic and geometric enhancement based affordance learning network(SSGE-Net) is proposed in this paper. First, a physics-aware semantic enhancement module is constructed to generate structured triplets of geometric constraints, functional descriptions and interaction logic. Semantic densification is achieved with these triplets. Thus, the lack of instruction information is compensated for. Second, a multi-scale geometry refinement mechanism is designed. Complementary topological details are captured by utilizing local dynamic graph convolution and global self-attention mechanisms to enhance feature discriminability. Finally, a deep cross-modal alignment mechanism based on Transformer decoders is proposed. Point cloud features are dynamically reconstructed by cross-attention under semantic guidance to achieve precise anchoring. Extensive experiments on 3D AffordanceNet dataset demonstrate that SSGE-Net achieves consistent performance improvements under both full-view and partial-view settings. These results validate its superiority and robustness in complex viewpoints and long-tail category scenarios.
[1] LIU Y, CHEN W X, BAI Y J, et al. Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI. IEEE/ASME Transactions on Mechatronics, 2025, 30(6): 7253-7274. [2] THERMOS S, PAPADOPOULOS G T, DARAS P, et al. Deep Affor-dance-Grounded Sensorimotor Object Recognition // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 49-57. [3] QI C R, YI L, SU H, et al. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 5105-5114. [4] WU J Z, LI X T, XU S L, et al. Towards Open Vocabulary Lear-ning: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(7): 5092-5113. [5] 聂秀山,赵润虎,宁阳,等.开放词汇目标检测方法综述.山东大学学报(工学版), 2025, 55(1): 1-14. (NIE X S, ZHAO R H, NING Y, et al. Survey of Open Vocabulary Object Detection Methods. Journal of Shandong University(Engineering Science), 2025, 55(1): 1-14.) [6] ZHU C Y, CHEN L. A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(12): 8954-8975. [7] RADFORD A, KIM J W, HALLACY C, et al. Learning Transferable Visual Models from Natural Language Supervision // Proc of the 38th International Conference on Machine Learning. San Diego, USA: JMLR, 2021: 8748-8763. [8] NGUYEN T, VU M N, VUONG A, et al. Open-Vocabulary Affordance Detection in 3D Point Clouds // Proc of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Washington, USA: IEEE, 2023: 5692-5698. [9] LI C M, ZHU Y C, WEN J J, et al. PointVLA: Injecting the 3D World into Vision-Language-Action Models[C/OL]. [2025-12-16]. https://arxiv.org/pdf/2503.07511. [10] FENG T T, WANG X, JIANG Y G, et al. Embodied AI: From LLMs to World Models[C/OL]. [2025-12-16]. https://arxiv.org/pdf/2509.20021v1. [11] 曹振中,光金正,张千一,等.基于3D高斯溅射的3维重建技术综述.机器人, 2024, 46(5): 611-622. (CAO Z Z, GUANG J Z, ZHANG Q Y, et al. Survey of 3D Reconstruction Techniques Based on 3D Gaussian Splatting. Robot, 2024, 46(5): 611-622.) [12] DENG S H, XU X, WU C Z, et al. 3D AffordanceNet: A Benchmark for Visual Object Affordance Understanding // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 1778-1787. [13] CHERAGHIAN A, RAHMAN S, PETERSSON L. Zero-Shot Lear-ning of 3D Point Cloud Objects // Proc of the 16th International Conference on Machine Vision Applications. Washington, USA: IEEE, 2019. DOI: 10.23919/MVA.2019.8758063. [14] CHERAGHIAN A, RAHMAN S, CAMPBELL D, et al. Transductive Zero-Shot Learning for 3D Point Cloud Classification // Proc of the IEEE Winter Conference on Applications of Computer Vision. Washington, USA: IEEE, 2020: 912-922. [15] MICHELE B, BOULCH A, PUY G, et al. Generative Zero-Shot Learning for Semantic Segmentation of 3D Point Clouds // Proc of the International Conference on 3D Vision. Washington, USA: IEEE, 2021: 992-1002. [16] VAN VO T, VU M N, HUANG B R, et al. Open-Vocabulary Affordance Detection Using Knowledge Distillation and Text-Point Correlation // Proc of the IEEE International Conference on Robo-tics and Automation. Washington, USA: IEEE, 2024: 13968-13975. [17] NGUYEN N, VU M N, TA T D, et al. Robotic-CLIP: Fine-Tu-ning CLIP on Action Data for Robotic Applications // Proc of the IEEE International Conference on Robotics and Automation. Wa-shington, USA: IEEE, 2025: 5930-5936. [18] LIAO G B, ZHOU K C, BAO Z Y, et al. OV-NeRF: Open-Vocabulary Neural Radiance Fields with Vision and Language Foundation Models for 3D Semantic Understanding. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(12): 12923-12936. [19] SHAO Y W, ZHAI W, YANG Y H, et al. GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2025: 17326-17336. [20] LU D Y, KONG L D, HUANG T X, et al. GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2025: 1680-1690.